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ABSTRACT 


The  Monterey  County  Health  Department  (MCHD)  in  California  uses  the  Early 
Aberration  Reporting  System  (EARS)  to  monitor  emergency  room  and  clinic  data  for 
biosurveillance,  particularly  as  an  alert  system  for  various  types  of  disease  outbreaks.  The 
flexibility  of  the  system  has  proven  to  be  a  very  useful  feature  of  EARS;  however,  little 
research  has  been  conducted  to  assess  its  performance.  In  this  thesis,  a  quantitative 
analysis  based  on  modifications  to  EARS’  internal  logic  and  algorithms  is  assessed. 
Eogic  is  used  as  a  counting  tool  for  potential  cases  of  outbreak,  and  the  Early  Event 
Detection  (EED)  algorithms  are  used  to  determine  whether  or  not  an  outbreak  is  about  to 
occur.  The  EED  methods  are  compared  by  assessing  their  ability  to  detect  the  presence  of 
a  known  HlNl  outbreak  in  Monterey  County.  This  research  found  the  cumulative  sum 
(CUSUM)  detection  method  to  be  the  most  reliable  in  signaling  the  HlNl  outbreak, 
across  all  combinations  of  logic  explored. 
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EXECUTIVE  SUMMARY 


The  Monterey  County  Health  Department  (MCHD)  in  California  uses  the  Early 
Aberration  Reporting  System  (EARS)  to  monitor  emergency  room  and  clinic  data  for 
biosurveillance,  particularly  as  an  alert  system  for  various  types  of  disease  outbreaks, 
both  natural  and  man-made  (e.g.,  bio-terrorism).  The  concept  behind  Early  Event 
Detection  (EED)  is  to  quickly  detect  abnormalities  from  the  normal  trends  so  that  public 
health  authorities  can  take  the  appropriate  action  to  deal  with  them.  In  particular,  the 
intention  is  to  expedite  the  detection  of  disease  outbreaks  by  using  data  based  on  pre¬ 
diagnosis  “syndromes  with  the  hope  that  at  least  for  some  outbreaks  there  will  be  a 
sufficiently  strong  signal  in  the  data  that  the  outbreak  can  be  detected  using  a  statistical 
algorithm  in  advance  of  the  first  case  diagnosis  by  a  medical  professional. 

Monterey  County’s  EARS  system  uses  data  from  six  public  clinics  and  four 
hospitals  located  throughout  the  county.  The  data  received  from  MCHD  include  the  date 
of  patient  visit,  age,  sex,  and  home  ZIP  code  of  the  patient,  chief  complaint,  and 
diagnosis  code  for  the  clinics  only.  Daily  counts  for  various  syndromic  categories  are 
calculated  based  on  the  presence  or  absence  of  key  words  in  the  chief  complaints.  Eor 
example,  existence  of  either  the  word  “flu”  or  the  phrase  “fever  and  cough”  in  an 
individual's  chief  complaint  would  result  in  that  individual  being  included  in  the 
Influenza-Eike  Illness  (lEI)  syndrome  count  for  that  day. 

The  flexibility  of  the  system  has  proven  to  be  a  very  useful  feature  of  EARS; 
however,  little  research  has  been  conducted  to  assess  the  performance  of  EARS. 
Specifically,  are  there  any  changes  that  can  be  made  to  EARS’  logic  and/or  settings  that 
would  maximize  the  system’s  ability  to  detect  disease  outbreaks?  Also,  how  do  these 
changes  affect  EARS’  ability  to  detect  a  particular,  known  outbreak,  such  as  the  novel 
2009  HlNl  virus? 


XV 


To  answer  these  questions,  a  quantitative  eomparison  was  conducted  by 
implementing  modifications  to  EARS’  logic  and  assessing  the  affect  on  daily  counts, 
which  is  one  of  the  key  measures  used  by  EARS  to  monitor  for  outbreaks.  Logic 
modifications  were  compared  by  evaluating  counts  of  the  ILI  syndrome  over  a  one  year 
period. 

As  shown  in  Eigure  1,  out  of  153,696  total  patient  records  from  August  1,  2008  to 
July  31,  2009,  the  logic  encoded  in  the  unmodified  EARS  system  (the  “Base  Case”) 
flagged  9,093  records  for  the  ILI  syndrome. 


Qualitative  Comparisons 
Aug  1,2008- July  31,2009 
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Eigure  I.  Results  of  changes  to  EARS  logic,  symptom  aliases,  and  syndrome 
definitions,  as  applied  to  MCHD  data  from  August  1,  2008-July  31,  2009 

The  second  tier  in  Figure  1  illustrates  the  number  of  ILI  syndrome  counts  when 
EARS’  symptom  aliases  and  syndrome  definitions  are  modified.  Specifically,  the 
“Variant  la”  box  on  the  left  is  based  on  an  expanded  ILI  syndrome  definition  as  well  as  a 
more  robust  symptom  alias  list  used  by  MCHD.  Variant  la  modifications  to  the  EARS 
logic  resulted  in  a  53%  increase  in  the  number  of  records  flagged  for  the  ILI  syndrome. 
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In  comparison,  the  box  on  the  right  labeled  “Variant  2a”  used  a  restrictive  symptom  alias 
list  and  restrictive  syndrome  definitions  subsequently  employed  and  resulted  in  a 
reduetion  of  the  number  of  reeords  flagged  for  ILI  by  92%  of  the  original  “Base  Case.” 
This  illustrates  the  dramatic  impact  that  changes  in  EARS  logie  can  have  on  the  daily 
syndrome  counts. 

The  bottom  tier  of  Figure  1  illustrates  the  effects  of  changing  the  text-matching 
logic,  which  resulted  in  similarly  large  swings  in  the  number  of  coded  ILI  syndromes. 
For  example,  the  only  difference  between  “Variant  la”  and  “Variant  lb”  is  the  change  in 
text  matehing  logic,  which  results  in  a  62%  decrease  (13,956  down  to  5,414)  in  the 
number  of  records  flagged  for  ILI. 

Using  the  various  ILI  counts  that  result  from  the  logic  variants,  EARS 
performance  was  assessed  by  determining  the  system’s  ability  to  detect  a  known 
outbreak.  To  evaluate  this,  the  ILI  eounts  produeed  by  the  Base  Case,  Variant  la,  and 
Variant  2a  logic  were  then  used  as  inputs  into  the  EARS’  system.  In  addition  to  the 
modified  logic,  alternative  ELD  methods  based  on  the  eumulative  sum  (CUSUM)  were 
tested.  Lastly,  all  methods  were  compared  by  assessing  their  ability  to  deteet  the 
presenee  of  a  known  HlNl  outbreak  in  Monterey  County. 

The  CUSUM  ELD  method  proved  the  most  reliable  at  signaling  alarms  prior  to 
and  throughout  the  time  when  Monterey  County  was  experieneing  HlNl  oases. 
Currently,  EARS  does  not  utilize  the  CUSUM  algorithms.  When  testing  the  ourrent  FED 
methods.  Variant  2a  logie  was  shown  to  have  the  best  performanoe  in  terms  of  signals 
triggered  prior  to  an  outbreak.  Surprisingly,  under  original  and  Variant  la  sets  of  logic, 
EARS  methods  were  of  little  to  no  value  in  signaling  an  outbreak. 
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I.  INTRODUCTION 


A.  EARLY  EVENT  DETECTION  (EED) 

Health  emergencies  can  either  be  naturally  occurring  (e.g.,  influenza),  accidental 
(e.g.,  fire-related  illnesses),  or  intentional  (e.g.,  bioterrorism).  Given  the  possible  life- 
threatening  nature  of  these  situations,  decision  makers  require  timely  diagnosis  and 
reporting  to  reduce  the  negative  impact  to  public  health.  The  concept  behind  Early  Event 
Detection  (EED)  is  to  quickly  detect  abnormalities  from  the  normal  trends  so  that  public 
health  authorities  can  take  the  appropriate  action  to  deal  with  them.  More  formally,  the 
Centers  for  Disease  Control  and  Prevention  (CDC)  defines  EED  as  “supporting  the  early 
detection  of  health  events  including  determining  and  monitoring  the  size,  location  and 
spread  of  health  events,  and  providing  situational  awareness  to  assist  in  the  investigation 
and  management  of  health  events”  (CDC,  2006).  This  research  evaluates  EED  methods 
found  within  a  specific  biosurveillance  system  known  as  the  Early  Aberration  Reporting 
System  (EARS)  on  actual  HlNl  flu  data  from  Monterey  County,  California. 

B,  BIOSURVEILLANCE 

Shmueli  and  Burkom  (2010)  define  biosurveillance  as  “the  practice  of  monitoring 
data  to  detect,  investigate,  and  respond  to  disease  out-breaks.”  Homeland  Security 
Presidential  Directive  21  (HSPD-21,  2007)  further  defines  biosurveillance  as: 

...the  process  of  active  data-gathering  with  appropriate  analysis  and 
interpretation  of  biosphere  data  that  might  relate  to  disease  activity  and 
threats  to  human  or  animal  health — whether  infectious,  toxic,  metabolic, 
or  otherwise  and  regardless  of  intentional  or  natural  origin — in  order  to 
achieve  early  warning  of  health  threats,  early  detection  of  health  events, 
and  overall  situational  awareness  of  disease  activity. 

Before  the  late  1990s,  traditional  biosurveillance  generally  took  a  retrospective 
approach  for  determining  the  cause  of  disease  outbreaks.  Such  outbreaks  were  generally 
identified  only  after  one  or  more  patients  had  been  diagnosed  by  a  medical  professional 
and  then  subsequently  reported  to  the  appropriate  public  health  authorities.  After 
diagnostic  medical  and  public  health  data  had  been  collected  and  analyzed  on  a  disease. 
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sometimes  it  would  take  weeks  or  months  to  report  these  findings.  The  problem  with 
such  delayed  reporting  is  that  it  is  more  difficult  for  medical  and  public  health  decision 
makers  to  take  mitigating  measures,  such  as  establishing  quarantines  for  infected 
individuals  and/or  regions. 

Modern  biosurveillance  systems  are  intended  to  drastically  shorten  the  time  it 
takes  to  analyze  and  report  data  of  interest,  with  the  goal  of  facilitating  the  proactive 
detection  and  management  of  outbreaks.  By  using  less  specific,  aggregated,  syndromic 
data,  modem  biosurveillance  systems  can  now  search  for  earlier  outbreak  signals,  often  in 
advance  of  an  actual  case  being  diagnosed,  which  may  lead  to  more  successful  public 
health  interventions  (Shmueli  &  Burkom,  2010). 

While  biosurveillance  systems  are  most  often  used  to  detect  and  monitor  natural 
diseases,  they  can  also  be  used  to  detect  bioterrorism  events.  Bioterrorism,  as  defined  by 
Evans  (2010),  “refers  to  the  intentional  release  of  organisms  that  can  cause  sickness  or 
death.”  Shmueli  and  Burkom  (2010)  caution  that  complications  can  arise  from  the 
intended  dual  use  of  biosurveillance  systems  for  detecting  natural  outbreaks  and 
bioterror-related  illnesses.  Specifically,  it  is  difficult  to  define  “normal  behavior,”  from 
which  to  derive  appropriate  baseline  information,  for  both  purposes.  For  example,  if  a 
bioterrorism  pathogen  such  as  tularemia  were  released  during  peak  flu  season,  a  dual-use 
biosurveillance  system  may  not  be  able  to  detect  the  bioterrorism  attack.  While  the  issue 
of  dual-use  biosurveillance  systems  is  beyond  the  scope  of  this  research,  it  is  likely  to 
play  a  continuing  and  significant  role  in  the  detection  and  monitoring  of  disease 
outbreaks  (Pricker,  Hegler  &  Dunfee,  2010). 

Figure  2  illustrates  that  since  2001,  the  Fl.S.  government  has  spent  substantial 
resources  on  preparing  the  nation  against  a  bioterrorist  attack,  including  a  proposed 
increase  in  funding  of  $271.3  million  in  the  President’s  FY2011  budget  (Franco,  2010). 
In  2004,  President  Bush’s  Project  BioShield  sought  to  address  the  challenges  of  potential 
chemical,  biological,  radiological,  and  nuclear  (CBRN)  terrorism  attacks.  To  see  more 
information  on  Project  BioShield,  refer  to  the  Congressional  Research  Service  Report  for 
Congress  (Gottron,  2009). 
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■  BtoShlaid  Funds  •  Total  USG  CiviMan  Btodafonsa  Funding  (mnus  BioSiiMd  funds) 


Figure  2.  Civilian  Funding  of  Biodefense  by  Fiseal  Year,  FY2001-FY201 1  in  $millions 

(From  Franco  &  Sell,  2010) 

1,  Syndromic  Surveillance 

This  research  assesses  the  performance  of  three  syndromic  surveillance  FED 
methods  that  are  implemented  by  EARS.  Figure  3  illustrates  how  this  type  of 
surveillance  fits  into  the  broader  category  of  biosurveillance.  To  begin,  epidemiologic 
surveillance  addresses  biosurveillance  as  it  applies  to  human  beings.  Even  more 
specialized,  syndromic  surveillance  is  defined  as  the  “the  ongoing,  systematic  collection, 
analysis,  interpretation,  and  application  of  real-time  (or  near-real-time)  indicators  of 
diseases  and  outbreaks  that  allow  for  their  detection  before  public  health  authorities 
would  otherwise  note  them”  (Sosin,  2003).  Table  1  provides  a  more  detailed  comparison 
of  the  various  types  of  data  used  in  biosurveillance,  epidemiologic  surveillance,  and 
syndromic  surveillance.  Notice  that  syndromic  surveillance  uses  the  least  medically 
specific  data,  which  is  often  derived  from  people  who  explain  their  symptoms,  otherwise 
known  as  chief  complaints,  to  hospitals  or  clinics  (Fricker  et  al.,  2010). 
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Biosurveillance 


Figure  3.  Illustration  of  the  various  subsets  of  biosurveillance,  to  include  epidemiologic 

surveillance  and  syndromic  surveillance. 


Category  of  Data 

B  ios  iirvei  1 1  ance 

Epidemiological 

Surveillance 

Syndromic 

.Surv'eillance 

Pre-diagnosis 

-  ER  chief  complaint 

X 

X 

X 

-  OTC  medicine  sales 

X 

X 

X 

-  EMS  call  rates 

X 

X 

X 

-  Absenteeism  records 

X 

X 

X 

-  Lab  results 

X 

X 

X 

-  Other 

X 

X 

X 

Medical  diagnoses 

Lab  results 

Water  and  air  monitoring 
Zoonotic 

Agricultural 

X 

X 

X 

X 

X 

X 

X 

Table  1 .  Comparison  of  the  categories  of  data  used  in  biosurveillance,  epidemiologic 
surveillance,  and  syndromic  surveillance  (From  Fricker,  2010) 

Figure  4  illustrates  the  FED  improvements  that  medical  and  public  health 
communities  hope  to  achieve  with  a  biosurveillance  system.  In  particular,  the  intention  is 
to  expedite  the  detection  of  disease  outbreaks  by  using  data  based  on  pre-diagnosis 
“syndromes,”  with  the  hope  that  at  least  for  some  outbreaks  there  will  be  a  sufficiently 
strong  signal  in  the  data  that  the  outbreak  can  be  detected  using  a  statistical  algorithm  in 
advance  of  the  first  case  diagnosis  by  a  medical  professional.  As  defined  by  the 
International  Foundation  for  Functional  Gastrointestinal  Disorders  (IIFGD),  a  syndrome 
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is  “a  set  of  symptoms  or  conditions  that  occur  together  and  suggest  the  presence  of  a 
certain  disease  or  an  increased  chance  of  developing  the  disease”  (IFFGD,  2010). 


Figure  4.  Illustration  of  how  biosurveillance  is  intended  to  improve  Early  Event 
Detection  (EED)  and  Situational  Awareness  (SA)  (Erom  Ericker,  2010) 

Some  examples  of  Monterey  County  syndromes  are  listed  in  Table  2,  while  Table 
3  offers  examples  of  actual  chief  complaints  taken  from  Monterey  County  clinic  data.  As 
described  in  Ericker,  et  al.  (2010): 

Syndromes  are  frequently  derived  from  emergency  room  chief  complaint 
data.  A  chief  complaint  is  a  brief  summary  of  the  reason  or  reasons  that 
an  individual  presents  at  a  medical  facility.  Written  by  medical  personnel, 
chief  complaints  are  couched  in  jargon,  acronyms,  and  abbreviations  for 
use  by  other  medical  professionals.  To  distill  the  chief  complaints  down 
into  syndrome  indicators,  the  text  is  searched  and  parsed  for  key  words, 
often  of  necessity  including  all  the  ways  a  particular  key  word  can  be 
misspelled,  abridged,  and  otherwise  abbreviated. 


Eever 


Gastrointestinal  (GI) 
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Hemorrhagic 

Lesion 

Neurological 


Upper-respiratory 
Lower-respiratory 
Influenza-like  Illness  (ILI) 


Table  2.  Typical  syndromes  used  in  syndromic  surveillance  systems. 

A  biosurveillance  system  has  four  main  components:  data  collection,  data 
management,  analysis,  and  reporting  (Pricker  et  ah,  2010).  Mandl,  Overhage,  Wagner, 
Lober,  Sebastiani,  Mostashari,  and  Pavlin  (2004)  provide  the  following  discussion  and 
guidance  about  how  to  implement  these  components: 

•  Data  Collection:  Electronically  stored  data  sources  are  necessary  because 
they  allow  for  robust  syndromic  grouping  and  are  typically  readily  available. 
It  is  usually  the  case  that  the  data  received  has  already  been  collected  for  other 
purposes.  Unfortunately,  implementing  a  new  process  is  deemed  as  cost 
prohibitive  and  administratively  taxing.  The  use  of  pre-existing  database 
systems  does  have  the  benefit  of  ensuring  the  availability  of  baseline  data, 
which  is  important  for  algorithm  development.  Public  health  officials  must 
then  determine  which  disease  and  associated  syndromes  should  be  tracked. 

•  Data  Management:  The  next  step  is  to  acquire  and  manipulate  the  data, 
which  can  either  be  done  manually  or  automatically.  Manual  acquisition  may 
require  personnel  resources  from  various  clinics,  hospitals,  etc  to  transfer  data. 

•  Analysis:  The  next  step  is  to  logically  group  the  data  in  some  way  that 
provides  useful  information.  Free-text  chief  complaints  can  be  grouped  into 
syndromes  using  statistical  algorithms  to  analyze  the  data  for  possible 
outbreaks  over  space  and  time.  Rolka  (2006)  notes  that  analysis  should  be  of 
“sufficient  sensitivity  to  provide  signals  within  an  actionable  time  frame  while 
simultaneously  limiting  false  positive  signals  to  a  tolerable  level.” 

•  Reporting:  The  final  step  in  the  biosurveillance  process  is  to  report  the 
findings  to  appropriate  medical  and  public  health  communities.  Having 
sufficient  and  timely  information  is  essential  for  designated  authorities  to  take 
necessary  action,  such  as  conducting  a  public  health  investigation. 
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FU  ANEMIA 

book  per  angie 

fever  x3  days  cough 

FEVER, WHEEZING 

VOMITING, FEVER, POSS  EAR  INFECT 

CHOP 

WI  C/0  HA//MM 
PAP  per  Kennedy 
BOOK  PER  MD  ROB 
COED 

WAEK  IN  BURN  TO  R-HAND 
FU  RESUETS 
FU  OB 

SHEDR  PAIN  FOR  1  WK 

WCC  2  MONTH.. ok 

4WKFU  OB..OVBK 

mom  and  baby 

POST  PARTUM/BOOK 

NP  MED  REFIEES 

VOMITING  AND  COUGH 

F/U  ASTHMA,  FEVER  AND  COUGH 


chdp 

FEVER,PHEEGM 
4WK  FU  OB 
FEVER 
PAP 

COUGH,DXd  W/  ASTHMA 

CHDP 

DEPO 

4WK  FU  OB  ..OVBK 
ABD  PAIN  CONJESTION 
new  bom  with  mom 
RASH 

FU  WT  CHECK 
PAP 

WCC  2  MONTHS 

VMX/IC 

NP/MUDGE/JR 

FU  VST  URGENT  FEET  SWOEEEN 
PRE-OP  VST 
walk-in  hospital  fu 

APPT  830  OK  PER  DR  WEEE  CHIED  EXAM 


Table  3.  Examples  of  aetual  chief  complaints  taken  from  Monterey  County’s  clinic  data. 

(From  Hanni,  2009a) 

Figure  5  illustrates  the  four  main  biosurveillance  system  components  described 
above.  First,  raw  health-related  data  is  collected  from  various  sources.  Next,  the 
incoming  data  is  processed  into  databases  by  data  management  experts  and  software. 
Statistical  algorithms  will  then  analyze  the  data  for  possible  outbreaks  over  space  and 
time.  Fastly,  the  information  must  be  communicated  to  medical  and  public  health 
communities  in  order  to  support  FED  and  SA  efforts. 
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Figure  5.  Four  main  components  of  a  biosurveillance  system;  data  collection,  data 
management,  analysis,  and  reporting.  (From  Fricker  &  Hanni,  2010) 

2,  Early  Aberration  Reporting  System  (EARS)  Syndromic  Surveillance 
System 

Although  a  number  of  syndromic  surveillance  systems  are  available,  EARS  uses 
aberration  detection  models  to  identify  deviations  in  current  data  when  compared  with  a 
moving  baseline  of  recent  data  (Lawson  &  Kleinman,  2005).  EARS  was  originally 
developed  by  the  CDC  as  a  method  for  monitoring  large-scale  bioterrorism  attacks  in 
locations  with  little  to  no  baseline  data  (e.g.,  less  than  7  days)  (CDC,  2010b).  Eor 
example,  EARS  was  used  for  syndromic  surveillance  at  the  Democratic  National 
Convention  in  2000,  the  Super  Bowl  and  World  Series  in  2001  (Flutwagner,  2003),  and 
Hurricane  Katrina  in  2005  (Toprani,  2006).  Following  the  terrorist  events  of  11 
September  2001,  EARS  has  also  been  used  as  a  routine  health  surveillance  system  by 
various  city,  county,  and  state  public  health  officials. 


EARS  is  primarily  focused  on  providing  public  health  care  officials  with  a  means 

for  early  event  detection  (EED).  It  is  important  to  remember  that  EED  does  not 
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necessarily  mean  that  an  outbreak  is  occurring.  Rather,  BED  provides  a  signal  that  an 
outbreak  may  be  occurring  and  potential  justification  for  expending  resources  to  further 
investigate.  As  a  by-product  of  this  investigative  process,  enhanced  SA  about  the 
specified  syndrome  will  more  than  likely  be  achieved.  It  is  important  to  note  that 
biosurveillance  systems  are  not  the  sole  means  of  detecting  an  outbreak.  It  may  very  well 
be  the  case  that  clinicians  or  sentinel  physicians  will  be  faster  at  detecting  an  outbreak 
than  a  biosurveillance  system.  Essentially,  detection  depends  on  the  specific 
circumstances  and  in  some  cases,  luck  (Ericker  et  ah,  2010).  Sosin  (2003)  conveys  the 
idea  that  biosurveillance  systems  can  act  as  a  safety  net,  should  the  traditional  detection 
methods,  such  as  clinical  diagnosis,  fail. 

3,  EARS  and  Monterey  County  Biosurveillance 

In  2004,  Monterey  County  staff  received  training  in  some  of  the  available 
biosurveillance  systems.  Ultimately,  the  county  decided  to  use  EARS  because  of  the 
system’s  flexibility  and  allowance  to  keep  the  data  local.  In  particular,  Monterey  County 
Health  Department  (MCHD)  liked  the  fact  that  it  could  develop  its  own  syndromes  for 
unique,  local  circumstances  such  as  agriculture  pesticide  spraying  and  fire-related  illness 
tracking  (Ericker  &  Hanni,  2010). 

Eigure  6  (a  tailored  version  of  Eigure  5)  illustrates  how  MCHD  implements  the 
EARS  biosurveillance  system.  Notice  that  the  final  reporting  step  corresponds  to  the 
Daily  Observational  and  Situational  Evaluation  (DOSE)  report  which  is  updated  daily 
and  posted  on  the  Interneti .  Eigure  7  is  an  example  of  what  the  Monterey  County  DOSE 
report  looks  like.  The  varying  levels  of  “alert”  correspond  to  a  color-scheme  of  green, 
yellow,  orange,  and  red.  A  green  block  indicates  that  there  were  no  alert  flags  from  the 
previous  day  (e.g.,  no  health  concerns)  while  a  red  block  indicates  multiple  alert  flags 
(e.g.,  highest  level  of  concern).  Alert  colors  other  than  green  (e.g.,  action  items)  are 
usually  discussed  at  the  beginning  of  the  report.  While  EARS  is  primarily  responsible  for 


1  The  most  current  DOSE  report  can  be  found  by  visiting  the  MCHD  Web  site  at: 
http://www.co.monterey.ca.us/health/healthalerts/pdf/MC_DOSE.pdf 
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analyzing  the  data  collection  from  clinics,  hospitals,  and  ambulance  reports,  the  DOSE 
report  also  encompasses  other  data  categories  as  specified  in  Figure  5. 


Data  Collection  Data  Management  Analysis  Reporting 


[  Clinics  ]  ■ 


Hospitals 


Monterey  County 
Health  Department 


Syndrome 

Definitions 


RS1 


Ambulance 

transports 


DOSE 

Reports 


Figure  6.  MCHD  implementation  of  the  EARS  biosurveillance  system  (After  Fricker  & 

Hanni,  2010) 

Prior  to  the  2009  HlNl  pandemic,  MCHD  had  gained  valuable  experience  in 
using  syndromic  surveillance  to  track  Influenza-Fike  Illness  (IFI)  and  to  improve 
response  plans.  Focal  hospitals  and  clinics  also  benefitted  from  having  access  to  these 
daily  reports,  and  thus  their  compliance  with  MCHD’s  data  requirements  improved. 
Once  the  HlNl  virus  began  to  affect  the  Monterey  County  population,  these  pre- 
established  relationships  helped  with  mutual  response  needs,  such  as  planning  for  and 
responding  to  personal  protective  equipment  (PPE)  requests  (Fricker  &  Hanni,  2010). 
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Figure  7. 


Monterey  County  Daity  Observational  and  Situational  Evaluation  (DOSE) 
Report  Friday,  May  28,  2010  2:50  PM 
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Illustration  of  the  reporting  component  of  a  biosurveillance  system,  the 
Monterey  County  DOSE  Report  for  Friday,  May  28,  2010  (From  MCHD,  2010b) 
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c. 


2009  HlNl  “SWINE  FLU”  VIRUS 


Cases  of  swine  flu  had  been  reported  in  literature  prior  to  2009;  however,  two 
children  in  southern  California  in  March  of  that  year  were  the  first  U.S.  introduction  to 
the  current  pandemic.  Within  a  week  of  the  CDC  determining  that  the  strains  were 
genetically  similar,  there  were  reports  of  widespread  severe  flu  activity  in  Mexico  and 
even  more  cases  in  the  United  States  and  possibly  Canada.  By  the  second  week  after 
local  health  departments  were  alerted,  and  widespread  HlNl  activity  was  reported  across 
North  America  and  into  Europe.  This  pandemic  spread  considerably  faster  than 
expected,  taking  only  six  weeks  to  go  from  a  local  outbreak  to  a  pandemic  (personal 
communication  with  Hanni,  2009). 

When  the  novel  HlNl  flu  outbreak  was  first  detected  in  mid- April  2009,  the  CDC 
began  working  with  states  to  collect,  compile,  and  analyze  information.  From  April  15  to 
July  24,  2009,  states  reported  a  total  of  43,771  confirmed  and  probable  cases  of  the  HlNl 
infection.  Of  these  cases  reported,  only  12  percent  were  either  hospitalized  or  died  (CDC, 
2010c).  Illustrated  in  Figure  8,  the  number  of  cases  reported  during  this  timeframe  per 
100,000  people  was  highest  among  the  5  to  24  year  age  group  (26.7  per  100,000)  and 
lowest  in  people  65  years  and  older  (1.3  per  100,000)  (Hanni,  2009). 
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Figure  8.  Confirmed  and  probable  HlNl  case  rate  by  age  group  from  April  15-July  24, 

2009  (From  Hanni,  2009) 
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Figure  9  illustrates  the  estimated  pandemic  HlNl  flu  hospitalization  rate  in  the 
United  States  by  age  group  from  April  15  to  July  24,  2009.  These  estimates  are  based  on 
the  4,738  hospitalizations  that  were  reported  to  the  CDC  during  this  time  period.  The 
reported  hospitalization  rate  per  100,000  people  was  highest  among  children  0  to  4  years 
of  age  (4.5  children  per  100,000)  and  lowest  among  people  in  the  25  to  49  years  of  age 
group  (1.1  per  100,000)  (personal  communication  with  Flanni,  2009). 
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Figure  9.  Estimated  pandemic  FllNl  flu  hospitalization  rate  in  the  United  States  by  age 

group  from  April  15  to  July  24,  2009  (From  Hanni,  2009) 

On  July  24,  2009,  confirmed  and  probable  case  counts  were  discontinued  after  the 
CDC  deemed  the  virus  “widespread”  across  the  United  States  (CDC,  2010c).  In  order  to 
approximate  the  number  of  novel  HlNl  flu  cases  in  the  US,  a  CDC  model  was  developed 
that  took  the  number  of  cases  reported  by  states  and  adjusted  the  figure  to  account  for 
underestimation.  For  instance,  not  all  people  with  the  virus  sought  medical  care,  and  of 
those  who  did,  some  may  not  have  been  specifically  tested  for  HlNl.  The  CDC  model 
estimated  that  more  than  one  million  people  became  ill  with  novel  HlNl  flu  between 
April  and  June  2009  in  the  United  States  (CDC,  2010c). 

Monterey  County  relates  the  symptoms  of  the  pandemic  HlNl  2009  influenza  to 
symptoms  of  regular,  seasonal  flu.  According  to  the  MCHD  website  as  of  July  2010, 
people  usually  exhibit  one  or  more  of  the  following  symptoms: 
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—Fever  greater  than  100°  F  —Sore  throat 

—Cough  —Headache 

—Body  aches  —Fatigue  and/or  dizziness 

—Chills  —Vomiting  and  diarrhea 

Federal  agencies  working  on  pandemic  influenza  planning  guidance  understood 
the  effects  that  HlNl  cases  would  have  on  the  delivery  of  patient  care.  This 
understanding,  however,  had  to  be  balanced  with  the  availability  of  scarce  resources 
(Hanfling  &  Hick,  2009).  Perhaps  Dr.  Hanni  (2009)  said  it  best: 

Our  surveillance  tools  are  many,  but  we  also  have  potential  for  a  strain  on 
our  resources,  both  staffing  and  supplies  for  the  coming  flu  season 
statewide  and  locally.  We  have  identified  new  risk  groups  that  will  need 
to  be  monitored  and  it  is  apparent  that  we  will  be  needing  to  monitor  for 
several  strains  of  influenza  and  other  respiratory  viruses,  especially  given 
the  fact  that  our  vaccine  for  HlNl  will  be  available  later  in  the  season.  To 
that  end,  there  have  been  some  changes  in  reporting  requirements  that 
have  also  resulted  in  some  changes  to  what  data  we  are  collecting  locally. 

Monterey  County  often  reviewed  the  CDC’s  guidance  for  determining  if  a  patient 
should  be  tested  for  the  HlNl  virus.  These  guidelines,  as  outlined  in  Figure  11,  were 
first  developed  by  the  CDC  in  mid-June  2009  and  last  updated  in  mid-September  2009. 
During  this  particular  period  of  increased  concern  over  HlNl,  MCHD  looked  at  each 
alarm  and  engaged  in  daily  discussions  with  the  Infection  Control  Practitioners  at  their 
four  hospitals  (personal  communication  with  Hanni,  2010).  In  other  words,  in  order  to 
achieve  sensitivity  for  detecting  HlNl  in  Monterey  County,  MCHD  was  willing  to 
tolerate  a  high  false  positive  rate. 

As  for  diagnosis,  the  MCHD  lab  was  able  to  process  samples  within  a  few  days  to 
a  week  which  indicated  whether  a  person  had  the  generic  Influenza  A  virus  (i.e.,  not  a 
known  HI  or  H3  virus).  If  a  person  tested  positive,  that  was  good  enough  to  proceed  as 
if  the  case  was  positive  for  HlNl  while  the  sample  was  sent  to  the  state  lab,  which  meant 
physicians  and  their  staffs  would  wear  the  appropriate  PPE  with  that  person  in  a  room. 
Due  to  the  large  influx  of  samples,  however,  state  lab  testing  was  sometimes  delayed  by 
as  much  as  several  weeks  (personal  communication  with  Hanni,  2010). 
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Clinical  management  ofpatteigs  should  not  be  dela)'ed  for  test  results. 

All  unsolicited  ^lecimens  leceis’ed  by  the  Monterey  County  Public  Health  Laboratory  that  do  not  meet  the 
above  medical  criteria  will  not  be  analyzed. 

Figure  10.  CDC  Testing  Reeommendations  for  Pandemic  (HlNl)  2009  Influenza  Virus, 

last  updated  September  08,  2009  (From  MCFID,  2009) 

Figure  11  shows  the  cumulative  total  of  confirmed  FIINI  cases  in  Monterey 
County  from  May  10,  2009  to  April  5,  2010.  With  47  cases,  September  2009  saw  the 
highest  number  of  confirmed  FIINI  counts.  August  and  October  2009  were  the  next 
highest  months  with  counts  in  the  mid-thirties. 
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Figure  11.  Number  of  confirmed  HlNl  cases  from  May  2009  to  April  2010  in  Monterey 

County  (From  Hanni,  2010) 

D.  ORGANIZATION  OF  THIS  THESIS 

There  are  two  main  components  of  EARS  that  are  evaluated  in  this  research,  the 
logic  and  the  EED  algorithms.  As  employed  before  the  first  HlNl  outbreak,  EARS  was 
deficient  in  detecting  signals  in  the  data.  The  following  chapters  are  organized  as 
follows.  Chapter  11  describes  how  modifications  to  EARS’  logic  and/or  settings  can 
affect  the  system’s  EED  performance.  Chapter  111  describes  the  algorithms  used  to 
evaluate  the  relative  performance  of  the  various  EED  methods  studied  against  confirmed 
cases  of  the  HlNl  virus,  as  well  as  a  description  of  how  various  input  and  threshold 
values  were  chosen.  Chapter  IV  summarizes  the  results  of  the  evaluation  and  makes 
recommendations  for  future  EED  improvement. 
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II.  SYNDROMES:  DEFINING  AND  CALCULATING  DAILY 

COUNTS 


A,  BACKGROUND 

EARS  was  originally  intended  to  serve  as  a  drop-in  surveillanee  system;  however, 
it  is  inereasingly  being  used  as  a  routine  health  surveillanee  system  by  loeal  public  health 
departments.  Unfortunately,  little  research  has  been  conducted  to  verily  the  EED 
performance  of  EARS;  specifically,  whether  there  are  any  changes  that  can  be  made  to 
EARS’  logic  and/or  settings  that  would  maximize  the  system’s  EED  performance.  To 
gain  more  insight,  this  chapter  describes  a  quantitative  comparison  of  how  modifications 
to  EARS’  logic  would  affect  daily  counts  (e.g.,  as  outlined  in  the  DOSE  report)  of  the  lEI 
syndrome.  The  following  three  areas  of  possible  logic  modifications  were  explored; 
syndrome  definitions,  symptom  aliases,  and  text  matching  algorithms. 

B,  CHIEF  COMPLAINT  DATA  COLLECTION  AND  REPORTING 

Monterey  County's  EARS  system  uses  data  from  six  public  clinics  and  four 
hospitals  located  throughout  the  county.  The  data  received  from  MCHD  include  the  date 
of  patient  visit;  age,  sex,  and  home  ZIP  code  of  the  patient;  chief  complaint  and  diagnosis 
code  for  the  clinics  only.  All  clinic  visits  that  occurred  during  the  previous  work  day  are 
electronically  transmitted  daily  to  MCHD.  When  clinic  offices  are  closed  during  the 
weekends  and  select  holidays,  data  transmission  does  not  occur. 

Daily  counts  for  various  syndromic  categories  are  calculated  based  on  the 
presence  or  absence  of  key  words  in  the  chief  complaints.  Eor  example,  existence  of 
either  the  word  “flu”  or  the  phrase  “fever  and  cough”  in  an  individual's  chief  complaint 
would  result  in  that  individual  being  included  in  the  ILI  syndrome  count  for  that  day. 
Refer  back  to  Tables  2  and  3  for  examples  and  follow-on  discussion  regarding  syndromes 
and  chief  complaints. 

Dr.  Hanni  (2009)  notes  that  MCHD  tracks  influenza  in  the  population  by  using 
reports  of  ILI  from  providers  and  clinics,  both  locally,  statewide,  and  nationally.  Here,  it 
is  important  to  identify  several  crucial  dates  and  events  from  such  reports.  As  depicted  in 
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Figure  12,  the  y-axis  shows  the  pereentage  of  visits  (by  week)  that  are  due  to  ILI, 
beginning  in  Oetober  2008  and  ending  in  Oetober  2009.  The  dashed  horizontal  line  is  the 
national  baseline,  above  whieh  classifies  as  an  epidemic  situation.^  Comparing  2009 
data  (in  red)  to  previous  years,  Monterey  County  experienced  a  relatively  mild  influenza 
season.  However,  in  late  April  and  early  May  2009,  there  was  an  unexpected  increase  in 
the  percentage  of  outpatients  that  were  being  seen  for  ILI,  which  continued  throughout 
the  summer  and  again  peaked  in  late  October  and  early  November  2009.  Also  of  note  is 
that  on  April  17,  2009,  the  CDC  determined  that  the  current  pandemic  HlNl  flu  virus 
was  active  in  the  United  States  (personal  communication  with  Hanni,  2009).  On  June  11, 
2009,  the  World  Health  Organization  (WHO)  declared  that  HlNl  was  a  global  pandemic 
(WHO,  2010). 


Figure  12.  Percentage  of  visits  for  ILI  reported  by  the  U.S.  Outpatient  ILI  Network, 
National  Summary  2008-09  and  previous  two  seasons  (From  CDC,  2009) 


^  For  a  detailed  diseussion  on  the  ILI  national  baseline,  refer  to 
http://scienceblogs.eom/effeetmeasure/2009/09/trying_to_make_sense_of_flu_se.php. 
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c. 


CHANGES  IN  LOGIC 


1,  Symptom  Aliases 

In  order  for  EARS  to  produce  meaningful  results,  raw  data  such  as  text-based 
chief  complaints  must  be  turned  into  numerically-coded  variables,  such  as  syndrome 
indicators  or  demographic  variables.  As  an  example,  the  following  pseudo-code  creates 
an  indicator  variable  “ili_ind”  for  the  ILI  syndrome  by  searching  for  keyword  substrings 
within  chief  complaint  text: 

loop  from  i  =  1  to  number  of  data  records 
set  ILI  ind(i)  =  0 

loop  from  j=l  to  number  of  ILI  keywords 

if  ILI  keyword(j)  is  a  substring  in  chief  complaint (i) 
then  ILI  ind(i)  =  1 


The  above  example  identifies  whether  a  set  of  keywords  or  symptoms  aliases  is 
contained  in  a  free-form  text  block.  If  so,  these  symptoms  will  become  matched  with  a 
particular  syndrome,  in  this  case  ILI.  Computer  logic  requires  a  list  of  keywords  to  be 
searched  for,  including  abbreviations,  acronyms,  and  common  misspellings;  however, 
caution  should  be  exercised  when  creating  this  list.  If  an  acronym  is  too  generic,  for 
example,  specificity  may  be  jeopardized. 

To  illustrate,  the  original  CDC  definition  for  the  ILI  syndrome  is  defined  by  the 
following  symptoms:  “sore  throat”  or  “cold”  or  “cough.”  Table  4  shows  a  subset  of  the 
actual  EARS’  “symptoms_code”  file,  which  maps  keywords  to  their  respective 


symptoms. 


Ke\'words 

Svmptom 

Ke^•\^•ords 

S\Tnptoni 

Kewords 

S^^nptonl 

SROETHROAT 

SORETHROAT 

COL 

COLD 

COUGH 

COUGH 

SSORE  THROAT 

SORETHROAT 

NOSE 

COLD 

C9UGH 

COUGH 

ST 

SORETHROAT 

URI 

COLD 

CCOUGH 

COUGH 

TBROAT 

SORETHROAT 

EAR  PAIN 

COLD 

CIUGH 

COUGH 

THROAT 

SORETHROAT 

DISCH 

COLD 

CKUGH 

COUGH 

TH40AT 

SORETHROAT 

OM 

COLD 

CLUGH 

COUGH 

Table  4.  Subset  of  the  CDC  “symptom_code”  file,  as  used  by  EARS  to  map  keywords  to 

symptoms  (e.g.,  sore  throat,  cold,  cough) 
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With  the  EARS’  SAS  code  that  MCHD  is  currently  using,  if  a  keyword  (as  found 
in  the  “symptoms_code”  file)  is  found  anywhere  within  the  free-text  chief  complaint 
field,  it  will  match  to  the  respective  syndrome.  This  simplified  matching  logic  can  be 
problematic,  as  in  the  case  where  “ASTHMA”  is  mapped  to  the  SORETHROAT 
symptom  because  of  the  keyword  “ST.”  Other  examples  of  inappropriate  matching 
include: 

•  “URINARY  INFECTION”  mapping  to  COLD  because  of  keyword  “URI” 

•  “COLPOSCOPY”  mapping  to  COLD  because  of  keyword  “COL” 

•  “MOM  NEEDS  FOLLOW-UP”  mapping  to  COLD  because  of  keyword  “OM” 

The  above  examples  demonstrate  that  without  a  more  sophisticated  text  matching 
algorithm,  the  above  pseudo  code  is  subject  to  spurious  false  positives.  A  better  approach 
might  involve  setting  ih_ind=l  when,  for  example,  “ST”  occurs  as  a  separate  word  in  the 
chief  complaint  text  (as  in  “ST  AND  FEVER”)  but  not  as  part  of  a  longer  word  (as  in  the 
“ASTHMA”  example  above).  The  algorithm  must  also  be  flexible  enough  to  map  root 
words  found  within  the  text  string  (e.g.,  “COUGH”  within  the  text  “COUGHING”). 

In  addition  to  concerns  associated  with  the  text  coding  algorithm,  the  “symptoms” 
file  itself  must  be  heavily  scrutinized.  For  example,  the  following  inappropriate 
mappings  were  discovered  when  using  the  original  CDC  symptoms  file: 

•  “NOSE  BLEEDS”  maps  to  COLD  because  of  keyword  “NOSE” 

•  “VAGINAL  DISCHARGE”  maps  to  COLD  because  of  keyword  “DISCH” 

•  “4  YEAR  OLD  WCC”  maps  to  COLD  because  of  keyword  “OLD” 

Thus,  in  an  effort  to  improve  the  overall  specificity  of  the  EARS  program,  aliases 
“NOSE,”  “DISCH,”  and  “OED”  were  removed  from  the  “symptoms”  file. 

2,  Syndrome  Definitions 

Figure  13  illustrates  three  possible  lEI  syndrome  definitions.  According  to  the 
original  CDC  EARS  definition,  a  record  is  flagged  for  lEI  when  a  patient  complains  of 
any  one  or  more  of  the  following  symptoms:  “sore  throat”  or  “cold”  or  “cough.”  As 
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shown  in  Figure  13,  MCHD  created  an  expanded  definition  of  the  ILI  syndrome  by 
allowing  for  many  more  symptom  possibilities.  In  doing  so,  the  goal  was  to  increase  the 
chances  of  correctly  classifying  someone  with  the  flu,  though  this  strategy  comes  at  the 
cost  of  also  increasing  the  number  of  false  positives  (i.e.,  counting  those  without  the  flu 
in  the  ILl  syndrome). ^ 


^  ILI  Syndrome  Definitions 

•  EARS  (CDC) 

Expanded  (MCHD) 

-  Sore  throat  or 

-  Cold  or 

-  Cold  or 

-  Cough  or 

-  Cough 

-  Fever  or 

•  Restricted  (MCHD) 

-  Chills  or 

-  Fever  &  cough 

-  Muscle  pain  or 

-  Fever  &  sore  throat 

-  Headache  or 

-  Fever  &  cough  &  sore 

-  Flu  and 

throat 

-  Not  shot 

I  -  Flu  and 
-  Not  shot 


Figure  13.  Three  versions  of  the  ILl  syndrome  definition;  original  EARS  (CDC), 

expanded  MCHD,  and  restricted  MCHD 

Monterey  County  now  currently  uses  a  more  restricted  definition  of  the  ILl 
syndrome  which  has  substantially  lowered  the  number  of  records  flagged.'^  Instead  of 
simply  using  one  symptom  to  flag  ILI,  they  now  require  more  “evidence,”  in  the  sense 
that  to  flag  for  ILI  someone  has  to  have  two  or  more  symptoms,  such  fever  and  cough, 
for  example.  By  using  the  restricted  definition,  the  current  goal  is  to  limit  the  chances  of 
incorrectly  counting  an  individual  in  the  ILI  syndrome  who  does  not  actually  have  the 


^  Medical  and  public  health  professionals  would  say  that  the  expanded  definition  is  intended  to 
increase  sensitivity  at  the  cost  of  decreasing  specificity. 

^  “Not  shot”  under  the  Restricted  (MCHD)  ILI  syndrome  definition  means  that  the  word  “shot”  is  not 
included  in  the  chief  complaint  field.  This  ensures  that  a  chief  complaint  containing  the  text  “flu  shof  ’  will 
not  be  included  as  an  ILI  syndrome  indicator  because  of  the  existence  of  the  word  “flu”  in  the  text. 
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flu,  though  this  strategy  comes  at  the  cost  of  a  greater  chance  of  missing  some  true 
positives  (i.e.,  failing  to  count  those  with  the  flu  in  the  ILI  syndrome).^ 

To  illustrate  the  implications  of  different  syndrome  definitions,  Figure  14  shows 
the  estimated  aggregate  ER  ILI  activity  for  Monterey  County  and  California  since  fall 
2009.  The  upper  line  is  the  result  of  using  the  expanded  ILI  syndrome  definition  as 
compared  to  the  original  CDC  definition  (sore  throat,  cold,  or  cough)  used  by  California 
Sentinel  Providers.  At  a  close  look,  the  two  plots  seem  to  have  similar  trends  and  signals. 
This  correlation  may  imply  that  it  doesn’t  really  matter  which  definition  you  use; 
however,  it  could  be  that  the  larger  cycle  in  the  upper  line  could  mask  a  real  signal. 
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Ligure  14.  Emergency  Room  Influenza-Like  Illness  Visits  for  Monterey  County  and 

California,  2008-2009  Season  (Lrom  Hanni,  2009) 

3.  Text  Matching  Algorithms 

Changes  to  the  EARS’  text  matching  logic  can  also  have  a  large  and  significant 
impact  on  the  number  of  individuals  coded  with  a  given  syndrome.  In  the  context  of  the 
ILI  syndrome,  the  original  EARS  (CDC)  text-matching  logic  basically  says  the 
following:  if  an  ILI  keyword  is  found  anywhere  within  the  chief  complaint  field  (even  in 


^  For  the  restricted  definitions,  medical  and  public  health  professionals  would  say  that  it  increases 
specificity  at  the  cost  of  decreasing  sensitivity. 
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the  middle  of  a  word),  then  it  will  be  flagged  as  an  ILI  indieator.  By  using  sueh  simple 
logic,  words  like  “COLPOSCOPY”  will  be  coded  as  a  “COLD”  symptom  because  of  the 
keyword  “COL.”  In  a  similar  example,  if  the  letters  “MI”  map  to  the  “CARDIAC” 
symptom,  one  would  not  want  the  word  “VOMIT”  to  be  associated  with  cardiac 
symptoms  just  because  the  letters  “MI”  were  contained  within  that  word. 

In  an  effort  to  mitigate  inappropriate  symptom  coding,  a  more  sophisticated 
approach  would  be  to  revise  the  text  matching  logic.  Under  the  proposal  known  as 
“enhanced  NPS  logic,”  the  text-matching  algorithm  only  allows  symptom  matches  if  ILI 
keywords  are  at  the  beginning  or  end  of  a  word  (or  matches  the  word  exactly).  The  idea 
is  that  for  any  keyword  that  is  less  than  or  equal  to  three  letters  long,  there  cannot  be  any 
letters  before  or  after  it.  Otherwise,  it  will  not  count  as  a  symptom  indicator.  For 
example,  by  using  enhanced  NPS  logic,  the  keyword  “ST”  in  chief  complaint  “TEST” 
would  not  be  flagged  as  an  ILI  indicator,  nor  would  the  keyword  “COL”  in  chief 
complaint  “RE-COLPO”  be  flagged  as  an  lEI  indicator.  The  reason  in  both  cases  is 
because  there  are  letters  on  either  side  of  the  keyword.  See  Eigure  15  for  a  graphical 
depiction  of  the  examples  above.  ^ 


^  “Red”  or  dark  colored  blocks  indicate  that  text-matching  logic  will  prevent  keywords  from  mapping 
to  symptoms  if  additional  letters  or  symbols  are  present.  “Powder  blue”  or  light  colored  blocks  indicate 
that  additional  letters  or  symbols  are  allowed  by  text-matching  logic  in  mapping  keywords  to  symptoms. 
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Enhanced  (NPS)  Logic 

-  For  short  words  (<3  characters) 

•  No  variations  on  alias  words  allowed 

•  Example;  (|]ST[|| 

-  NP  FOR  HIV  TE^  PER  VERONICA/CHART  MADE/LM 

•  Example:  [^001^ 

-  RTN  RE-CO LPO/LM 

-  For  longerwords  (>4  characters) 

•  Vanations  on  one  side  of  the  alias  word  is  allowed 

•  Example  JcOUGHQ 

-  PT  COUGHING  FOR  4  DAYS 

•  Example;  []OUGH  I 

-  Rys  PREV  APPT  CALL  NOT  GOING  THROUGH 


Figure  15.  Examples  of  how  “Enhanced  NPS  Eogic”  works  for  text-matching  keywords 

with  symptoms 

Eor  words  greater  than  or  equal  to  four  characters  using  enhanced  NPS  logic, 
variations  on  the  symptom  keywords  are  allowed  but  only  on  one  side  of  the  word  or  the 
other.  Eor  example,  the  word  “COUGH”  can  have  many  variations,  such  as  COUGHS, 
COUGHED,  or  COUGHING.  In  which  case,  it  is  appropriate  to  flag  such  variations  of 
the  word  “COUGH”  as  being  an  ILl  indicator.  Unfortunately,  using  the  enhanced  logic 
alone  does  not  necessarily  guarantee  that  all  inappropriate  mappings  will  be  eliminated. 
Eor  example,  keyword  “OUGH”  would  allow  the  following  chief  complaint  to  be 
inappropriately  flagged  for  ILL  “PREV  APPT  CALE  NOT  GOING  THROUGH”.  This 
example  also  highlights  the  importance  of  carefully  reviewing  the  “symptoms”  file.  See 
Appendix  A  for  the  R  code  on  implementing  “enhanced  NPS  logic.” 

4,  Summary  of  Changes  in  Logic 

After  exploring  the  three  areas  of  logic,  Eigure  16  illustrates  the  quantitative 
results  of  the  analysis  performed.  The  top  box  in  Eigure  16  refers  to  the  original  CDC 


text  matching  logic,  symptom  aliases,  and  syndrome  definitions  as  the  “Base  Case.” 
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When  running  these  algorithms  together,  out  of  153,696  total  reeords,  9,093  reeords 
(almost  6%)  were  flagged  as  being  ILI  syndromes. 


Qualitative  Comparisons 
Aug  1,2008- July  31,2009 
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Figure  16.  Qualitative  results  of  ehanges  to  EARS  logie,  symptom  aliases,  and  syndrome 
definitions,  as  applied  to  MCHD  data  from  August  1,  2008-July  31,  2009 

In  the  second  tier  of  Figure  16,  the  EARS  text  matching  logic  remained  the  same, 
while  the  symptom  aliases  and  syndrome  definitions  were  altered.  The  box  on  the  left 
indicated  as  “Variant  la”  used  the  expanded  ILI  syndrome  definition  (refer  to  Figure  13) 
as  well  as  a  more  robust  symptom  alias  list,  which  has  been  previously  demonstrated  to 
yield  an  increase  in  spurious  matches.  The  expanded  aliases  and  syndrome  definitions 
resulted  in  a  53%  increase  in  the  number  of  records  flagged  for  the  ILI  syndrome.  In 
comparison,  the  box  on  the  right  labeled  “Variant  2a”  used  the  restrictive  symptom  alias 
list  and  restrictive  syndrome  definitions,  which  resulted  in  a  reduction  of  the  number  of 
records  flagged  for  ILI  by  nearly  92%  of  the  original  “Base  Case.” 

The  bottom  tier  of  Figure  16  illustrates  the  significance  of  changes  to  the  text¬ 
matching  logic,  which  results  in  similarly  large  swings  in  the  number  of  coded  ILI 
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syndromes.  For  example,  the  only  differenee  between  “Variant  la”  and  “Variant  lb”  is 
the  ehange  in  text  matehing  logie,  whieh  results  in  a  62%  deerease  (13,956  down  to 
5,414)  in  the  number  of  reeords  flagged  for  ILL 

Figure  17  is  a  visual  representation  of  smoothed  ILI  eounts  over  time  using  the 
five  sets  of  logie  described  above.  The  Base  Case  (solid  black  line)  and  Variants  la  and 
lb  (dashed  lines  directly  above  and  below  the  solid  line)  basically  look  to  be  the  same 
curve  shifted  up  or  down,  which  suggests  those  variants  are  simply  adding  in  or 
subtracting  out  some  sort  of  average  level  of  noise.  On  the  other  hand,  there  are  some 
smaller  differences  (in  terms  of  the  “spikes”),  which  may  turn  out  to  be  significant 
differences. 

Variants  2a  and  2b  (the  two  lowest  curves),  look  different  from  the  other  three, 
not  just  because  of  the  significantly  lower  average  daily  counts,  but  also  because  it  looks 
like  some  of  the  trends  differ.  For  example,  the  time  series  from  la,  lb,  and  the  Base 
Case  all  show  a  clear  spike  between  times  0-50,  followed  by  another  spike  between  times 
50-100.  In  contrast.  Variants  Ib  and  2b  only  seem  to  show  a  spike  from  times  50-100. 
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ILI  Syndrome 


Figure  17.  Smoothed  ILI  counts  from  MCFID  during  August  1,  2008-July  31,  2009 
(excluding  weekends)  using  five  different  combinations  of  text  matching  logic, 
symptom  aliases,  and  syndrome  definitions. 

In  Chapter  III,  the  ILI  counts  produced  by  the  Base  Case,  Variant  la,  and  Variant 
2a  logic  will  be  used  as  input  into  EARS  and  other  EED  algorithms.  Unfortunately,  it  was 
simply  not  possible  to  choose  the  best  set  of  daily  counts  using  a  “gold  standard.”^  In  a 
study  performed  by  Espino  and  Wagner  (2010),  the  classification  performance  of 
International  Classification  of  Diseases  9*’’  Edition  (ICD-9)  based  ^  detectors  was 
measured  against  the  human  classification  of  cases  and  found  that: 


^  As  defined  by  The  Free  Dictionary  (http://www.thefreedictionary.com/gold+standard),  gold 
standards  are  “the  supreme  example  of  something  against  which  others  are  judged  or  measured.” 

^  As  defined  by  The  Free  Dictionary  (http://medical-dictionary.thefreedictionary.eom/ICD-9-CM), 
ICD-9  codes  are  “A  standardized  classification  of  disease,  injuries,  and  causes  of  death,  by  etiology  and 
anatomic  localization  and  codified  into  a  6-digit  number,  which  allows  clinicians,  statisticians,  politicians, 
health  planners  and  others  to  speak  a  common  language,  both  U.S.  and  internationally.” 
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ICD-9-coded  diagnoses  offer  no  advantages  in  positive  predietive  value 
and  speeifieity.  However,  beeause  the  diagnosis  eodes  are  signifieantly 
delayed  (on  average  by  7.5  hours),  it  is  elearly  the  ease  that  deteetion 
systems  should  foeus  on  the  ehief  eomplaint  data,  when  it  is  available. 

Sinee  elinies  adhere  to  ICD-9  eodes  for  insuranee  billing  purposes,  it  would  be 
plausible  to  assume  that  these  eodes  eould  be  used  as  the  “gold  standard.”  Unfortunately, 
there  were  many  obvious  issues  that  made  ICD-9  eodes  unsuitable  for  use.  Speeifieally, 
the  eode  V20  appeared  quite  often  as  a  eateh-all  for  many  seemingly  unrelated  illnesses 
(e.g.,  flu,  women's  wellness  exams,  HPV  shots,  eto).^  For  these  reasons  and  despite 
having  ICD-9  eodes  in  the  MCHD  dataset,  it  was  determined  not  to  use  them  as  a  “gold 
standard”  and  instead  use  doeumented  HlNl  eases.  Finally,  Chapter  IV  will  eompare  the 
results  of  various  algorithms  to  see  if  one  or  more  ean  do  better  at  deteeting  a  known 
HlNl  outbreak  in  Monterey  County. 


9  V  codes  are  considered  to  be  ‘Supplementary  Classification  of  Factors  Influencing  Health  Status  and 
Contact  With  Health  Services,’  where  the  V20  code  is  listed  as  ‘health  supervision  of  infant  or  child’  under 
the  subcategory  of  ‘Persons  Encountering  Health  Services  In  Circumstances  Related  To  Reproduction  And 
Development.’ 
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III.  EARLY  EVENT  DETECTION  METHODS 


This  chapter  describes  the  BED  methods  that  are  tested  in  conjunetion  with  the 
various  sets  of  logic  presented  in  Chapter  II.  These  BED  algorithms  include  the  methods 
eurrently  used  in  the  EARS  system  as  well  as  a  CUSUM -based  method  first  described 
and  evaluated  in  Ericker  et  al.  (2008).  The  goal  of  these  methods  is  to  monitor  the 
syndromic  data  as  it  comes  into  MCHD  over  time  and  signal  when  the  data  deviates 
signifieantly  from  trends  in  the  recent  past. 


A,  EXISTING  EED  METHODS:  EARS 


As  described  in  Ericker  et  al.  (2008),  the  eurrent  EARS’  deteetion  methods  are 
called  “Cl,”  “C2,”  and  “C3.”  The  C  is  likely  an  abbreviation  for  the  eumulative  sum 
(CUSUM)  methodology  from  whieh  the  EARS  documentation  claims  these  methods 
were  derived.  However,  as  Section  B  makes  elear,  this  description  is  incorrect  because 
none  of  these  methods  are  aetually  based  on  or  derived  from  the  CUSUM. 


The  C 1  method  caleulates  a  standardized  syndrome  daily  count  for  day  t  using  the 
sample  average  and  sample  standard  deviation  estimated  from  the  previous  7  days  of 
daily  eounts. 


Ci(o 


S<t) 


(1) 


where 

•  Y (t)  is  the  observed  syndrome  eount  for  day  t 

•  Fi(t)is  the  sample  mean  based  on  the  previous  7  days  of  data, 

Nf)  =  )Zm),and 

/  j=t-i 

•  the  sample  standard  deviation  based  on  the  previous  7  days  of  data, 

O  J={-1 
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As  implemented  in  EARS,  the  Cl  method  signals  an  alarm  at  time  t  when  the 
C\{t)  statistie  exeeeds  a  fixed  threshold  h,  whieh  oeeurs  when  the  observed  eount  exeeeds 
three  sample  standard  deviations  from  the  sample  mean:  Ci(t)>3. 


The  C2  method  is  very  similar  to  Cl  but  differs  in  that  it  uses  a  2-day  lag  before 
ealeulating  a  standardized  value  using  the  previous  7  days  worth  of  data.  Speeifieally, 


C2(0 


S,{t) 


(2) 


where 


•  Y {t)  is  the  observed  eount  for  period  t 


1  - 

•  Si{t)  is  the  moving  sample  standard  deviation,  S^{t)  =  —  V  [T (y)  -  Y i{j)f 

6  y=,-3 

The  C2  method  also  signals  at  time  t  when  the  C2  (0  statistie  exeeeds  a  fixed 
threshold  h,  whieh  oeeurs  when  the  observed  eount  exeeeds  three  sample  standard 
deviations  from  the  sample  mean:  C^{t)>2). 

Lastly,  the  C3  method  eombines  eurrent  and  historieal  data  from  day  t  and  the 
previous  two  days,  and  signals  an  alarm  when  C3(t)>2.  It  ealeulates  the  statistie  at  time  t 
as  follows: 


r-9 


Y i{t)  is  the  moving  sample  mean,  Y i{t)  =  —  Z  ^(7),  and 

7  j=t-7, 


C3(0  =  Zmax[0,C2(7)-l]  (3) 

j=i 

Of  partieular  note  and  eoneem,  EARS’  Cl,  C2,  and  C3  algorithms  faetor  in 
“zeros”  for  days  when  elinies  are  not  open  for  business  (e.g.,  weekends  and  holidays). 
Given  that  the  sample  mean  and  sample  standard  deviation  are  based  on  the  previous  7-9 
days  worth  of  data,  exeeeding  the  alarm  thresholds  will  prove  diffieult,  as  shown  in 
Chapter  IV. 
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B,  AN  ALTERNATIVE  EED  METHOD  BASED  ON  THE  CUSUM 
1.  Basic  CUSUM 

The  CUSUM  is  a  statistical  process  control  (SPC)  methodology  often  used  in 
managing  the  quality  of  manufactured  items.  See  Montgomery  (2001)  for  an  introduction 
and  Hawkins  and  Olwell  (1998)  for  a  comprehensive  treatment  of  the  CUSUM.  The 
most  commonly  used  CUSUM  is  of  the  form, 


=max[0,C, y/g-k], 


(4) 


where is  the  count  at  time  t+1,  ~  N(jUQ,a^)  is  the  desired  state  of  the  process, 

and  k  is  referred  to  as  the  “reference  interval.”  For  the  CUSUM  designed  to  detect  the 
shift  in  the  mean  of  a  normal  distribution  from  po  to  pi,  the  reference  interval  is  defined 
as 


_5 

2  2 


(5) 


where  pi  is  the  mean  shift  that  is  desired  to  be  detected  quickly.  Note  that  this  is  a  one¬ 
sided  CUSUM  designed  to  detect  increases  in  the  mean.  In  industrial  quality  control 
applications,  two  CUSUMs  are  often  employed — one  to  look  for  increases  in  the  mean 
and  another  to  monitor  for  decreases.  In  syndromic  surveillance,  however,  only 
employing  a  single  CUSUM  (such  as  Equation  4)  to  look  for  increases  is  appropriate 
since  detecting  decreases  in  disease  incidence  is  generally  not  of  interest. 


2.  CUSUM  Applied  to  Biosurveillance  Data 

In  biosurveillance,  the  data  is  unlikely  to  be  stationary  since  syndromic 
surveillance  daily  counts  often  have  uncontrollable  systematic  effects  and  trends  such  as 
an  annual  influenza  seasonal  cycle  and  day-of  the  week  effects.  Yet,  the  CUSUM  of 
Equation  4  and  its  use  in  quality  control  is  based  on  the  assumption  of  stationary  data.  It 
is  therefore  inappropriate  to  apply  the  CUSUM  directly  to  biosurveillance  data. 
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Instead,  per  Montgomery  (2001),  an  appropriate  approaeh  is  to  model  the 
systematie  trends  of  the  data  and  then  apply  CUSUM  to  the  prediction  errors  from  the 
model.  Pricker  et  al.  (2008)  found  the  “adaptive  regression  with  sliding  baseline”  of 
Burkom  (2007)  to  be  an  effective  modeling  technique  for  removing  systematic  trends  in 
syndromic  surveillance  data  and  this  work  will  use  the  same  approach. 

As  described  by  Pricker  et  al.  (2008),  the  basic  idea  behind  adaptive  regression  is 
as  follows.  Let  Xt  be  the  observation  on  day  t,  say  the  number  individuals  presenting  at  a 
clinic  or  emergency  room  with  a  particular  syndrome.  Por  t>n,  where  n  (the  “baseline 
period”)  is  some  fixed  number  of  time  periods,  model  the  most  recent  n  daily  syndrome 
counts,  Xt,  Xt.\,  .  ..,Xt.n+\,  as  a  linear  function  of  time  relative  to  day  f. 

^  —  t  +  +  ^2^Mon  ^3^Tues  ^4^Wed  ^  (^) 

where,  for  i  =  t,...,t-n+\,  Po  is  the  intercept  term.  Pi  is  the  slope,  the  /s  are  indicator 
functions  (/  =  1  on  the  relevant  day  of  the  week  and  7=0  otherwise)  and  s  is  the  error 
term  to  account  for  random  variability,  hollowing  the  approach  of  Pricker  et  al.  (2008) 
and  in  spite  of  the  non-normal  time  series  data,  the  model  is  fit  using  ordinary  least 
squares  regression  and  is  re-fit  each  day  by  using  the  most  recent  n  observations  as  the 
sliding  baseline. 

Once  fit,  the  model  is  used  to  estimate  the  predicted  count  for  the  current  day 

(t+1), 

=^o(0  + A(0^(^  +  l)  +  ^y(0  5  (7) 

where  and  Pj{t)  are  the  estimated  model  coefficients  from  the  regression  fit 

at  time  t,  and  where  Pj  {t)  is  the  relevant  estimated  day-of-the-week  coefficient.  Given 

the  daily  count  at  time  t+\,  Xt+\,  the  predicted  count  is  then  used  to  calculate  the 
prediction  error  at  time  t+\, 

A  =Y  -Y 

^(+1  ^i+i  ’ 
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(8) 


which  is  then  standardized  using  the  estimated  standard  deviation  of  the  prediction  errors 

t-m+\  2 

from  the  last  m  time  periods,  ,  for  o",  =  ^  (A.  -  A,J  so  that  ~  A^(0,1) . 

i=t 

For  biosurveillance,  the  CUSUM  of  Equation  4  thus  beeomes 

=max[0,Q+Z^^j-^].  (9) 

3.  Application  of  the  CUSUM  to  MCHD  Data 

Before  applying  Equation  9  directly  to  the  MCHD  dataset,  the  assumption  of 
normality  of  the  standardized  residuals  was  assessed  using  a  “historieal”  set  of  data. 
Using  this  subset  of  data,  a  baseline  period  of  n=35  days  and  an  additional  m=10  days, 
standardized  residuals  were  ealeulated  from  an  adaptive  regression.  See  Appendix  B  for 
the  MATLAB  code  used  to  fit  the  adaptive  regressions  and  calculate  the  standardized 
residuals.  Normal  quantile-quantile  (Q-Q)  plots  demonstrated  that  the  standardized 
residuals  were  reasonably  normally  distributed.  See  Appendix  C  for  Q-Q  plots  and 
standardized  residual  plots  of  the  histroical  data  set. 

Although  Erieker  et  al.  (2008)  and  Burkom  (2007)  used  an  8-week  sliding 
baseline  {n=56  days,  based  on  a  7-day  week),  preliminary  researeh  on  the  MCHD  data 
found  a  7-week  sliding  baseline  (n=35  days,  based  on  a  5 -day  week)  to  be  preferred 
aeross  all  algorithm  variants.  This  preference  stemmed  from  evaluations  of  the  Q-Q  plots 
of  the  residuals,  where  for  smaller  and  larger  n  the  residuals  for  some  of  the  algorithm 
variants  began  to  show  departures  from  normality.  Erieker  et  al.  (2008)  cautions  that 
depending  on  the  partieular  outbreak  of  interest,  there  is  a  trade-off  between  the  amount 
of  historieal  data  used  and  the  predictive  accuracy  of  the  model. 

In  eireumstanees  where  it  is  important  to  detect  a  mean  increase  quickly,  such  as 
when  MCHD  was  on  high  alert  for  HlNl,  one  might  reasonably  want  to  detect  a  one 
standard  deviation  increase  in  the  mean.  Given  thatZ^^j  ~  A(0,1),  |Uo=0  and  thus  the 
reference  interval  from  Equation  5  for  the  standardized  residuals  becomes 

From  August  8,  2008-January  12,  2009  equals  approximately  1/3  of  the  entire  MCDFl  data  set  (or 
1 19  days  worth). 
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For  the  purposes  of  this  thesis,  the  k  for  detecting  this  magnitude  of  shift  in  the  mean  is 
called  an  “aggressive”  reference  interval  where,  for  n  =  35,k  =  0.56. 

Besides  choosing  the  reference  interval,  implementing  the  CUSUM  also  requires 
choosing  a  threshold  h.  The  choice  of  threshold  is  based  on  the  Average  Time  to  False 
Signal  (ATFS)  metric,  where  assuming  the  CUSUM  is  re-set  after  signals,  the  ATFS  is 
the  average  time  between  false  signals.  ^  Thus,  the  threshold  is  set  to  achieve  the 
smallest  AFTS  that  can  be  reasonably  accommodated,  given  the  finite  resources  available 
to  further  investigate  the  resulting  rate  of  false  positive  signals.  In  the  case  where  MCHD 
was  on  high  alert  for  HlNl  symptoms,  perhaps  an  ATFS  of  once  every  few  days  would 
be  acceptable.  Here,  an  “aggressive”  ATFS  was  set  to  be  5  days.  12 

In  order  to  determine  threshold  h  based  on  a  known  reference  interval  k  and 
known  ATFS,  Hawkins  and  dwell  (1998)  recommend  using  Siegmund’s  approximation 


ATFS- 


166) +2k(/; +  1.166)-! 

2p 


(11) 


This  approximation  is  not  accurate  for  small  ATFS  values  and  so  the  threshold 
was  instead  estimated  via  simulation.  Appendix  D  contains  the  R  code  for  the 
simulation.  For  a  given  k,  one  iteratively  runs  this  simulation  for  various  values  of  h 
starting  with  a  small  number  of  iterations,  variable  “x.”  As  one  gets  closer  to  the  desired 
ATFS  with  h,  increase  “x”  until  the  standard  error  becomes  small  enough.  Under  the 
“aggressive”  CUSUM  parameters  for  looking  at  HlNl  cases,  using  h  =  0.296  with  k  = 
0.56  achieves  an  ATFS  =  5  (s.e.  =  0.0045). 


1 1  In  SPC  terminology,  the  ATFS  when  the  biosurveillanee  algorithm  is  reset  is  funetionally 
equivalent  to  the  Average  Run  Length  (ARL). 

12  Five  days  is  one  full  week  (Monday-Friday)  sinee  elinies  are  not  open  during  the  weekends. 
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For  comparison  purposes,  this  thesis  also  looked  at  more  “routine”  parameters  for 
monitoring  the  ILI  syndrome  with  ATFS=20  and  k  =  1 .06  (an  approximate  2a  shift  in  the 
mean).  Using  the  simulation  found  in  Appendix  B  and  under  these  “routine”  parameters, 
h  =  0.62.  See  Table  5  for  a  summary  of  the  two  CUSUM  algorithm  parameters,  and  note 
their  labels:  CUSUM  1  (aggressive)  and  CUSUM  2  (routine). 


Parameters 

Type 

Label 

ATFS 

k 

h 

"Aggressive" 

CUSUM  1 

5 

0.56 

0.296 

"Routine" 

CUSUM  2 

20 

1.06 

0.62 

Table  5.  CUSUM  parameters  used  in  Matlab  eode  for  monitoring  the  ILI  syndrome 
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IV.  RESULTS 


This  chapter  summarizes  the  detection  performance  of  EARS  (Cl,  C2,  and  C3) 
and  CUSUM  (aggressive  and  routine)  algorithms  against  actual  HlNl  cases  observed  in 
Monterey  County  from  September  30,  2008  to  October  29,  2009.  The  subsections  are 
organized  by  the  ILI  counts  produced  by  the  various  sets  of  logie  (e.g..  Base  Case, 
Variant  la,  and  Variant  2a)  as  described  in  Chapter  11. 

A,  ILI  COUNT  COMPARISONS 

The  time  periods  (e.g.,  seasonal  flu,  E*  HlNl  wave,  2"‘*  HINl  wave)  labeled  in 
Figure  18  refer  to  the  national  ILl  outpatient  trends  in  Figure  12  from  September  30, 
2008  to  October  29,  2009,  with  an  overlay  of  the  smoothed  MCHD  IFl  eount 
comparisons  of  the  Base  Case,  Variant  la,  and  Variant  2a  logic.  Specifically,  the 
national  2009  seasonal  influenza  outbreak  began  a  steady  upward  trend  beginning  in  late- 
January  2009  and  peaked  in  late-February  into  early-Mareh  2009.  The  vertical  line  above 
the  date  3/2/09  in  Figure  18  represents  the  end  of  the  “historieal”  period,  with  the  first 
possible  EED  signal  oecurring  on  March  2,  2009. 


September  30,  2008,  to  February  27,  2009. 
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Smoothed  ILI  Counts 
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Figure  18.  Smoothed  ILI  count  comparisons  of  Base  Case,  Variant  lA,  and  Variant  2A 

logic  from  September  30,  2008  to  October  29,  2009 

The  next  period  of  interest  is  the  first  HlNl  flu  wave  (April  16  -  June  10,  2009), 
which  is  depicted  by  the  middle  two  vertically  dashed  lines  in  Figure  18.  After  June  10, 
an  upward  trend  continued  throughout  the  summer  and  worsened  at  the  time  when 
children  across  the  nation  were  going  back  to  school  (late  August-early  September).  As 
for  the  second  HlNl  flu  wave,  the  solid,  right-most  vertical  line  in  Figure  18  represents 
the  beginning  (September  1,  2009)  of  that  upward  ILI  count  trend  at  the  national  level. 


For  a  detailed  explanation  of  2009  national  ILI  trends  and  timeline,  refer  to  discussion  in  Chapter  II 
Section  B,  “Chief  Complaint  Data  and  Collection  Reporting.” 
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The  overall  trends  for  the  three  sets  of  logic  in  Figure  18  are  fairly  similar,  with 
the  exception  of  the  HlNl  wave  period.  Notice  how  the  Base  Case  trend  line  is 
downward  sloping  for  that  period,  while  Variants  la  and  2a  show  a  convincing  increase 
towards  the  end  of  the  dataset  (in  alignment  with  the  increasing  national  average).  Given 
the  small  scale  of  Figure  18  for  Variant  2a,  the  dramatic  increase  is  even  more  intriguing 
when  compared  to  the  Base  Case,  which  could  possibly  imply  that  the  larger  counts  were 
masking  the  true  signal  (e.g.,  more  ILI  cases). 

B,  ORIGINAL  CDC  LOGIC  (BASE  CASE) 

Figure  19  compares  actual  Monterey  County  HlNl  cases  with  the  results  of  five 
FED  algorithms  under  the  original  CDC  “Base  Case”  logic.  Notice  that  Figure  19  has 
some  additional  features  from  that  of  Figure  18.  Of  note,  the  small  “circles”  on  the  graph 
represent  the  aggregated  daily  flu  counts  for  Monterey  County  on  specific  days.  For 
example,  the  “circle”  at  the  first  peak  above  1 1/3/08  indicates  there  were  50  aggregated 
flu  counts  for  that  day.  Also,  note  that  the  five  EED  algorithms  and  corresponding  “|” 
marks  indicate  that  the  algorithm  thresholds  had  been  exceeded  for  that  particular  day 
(e.g.,  signaled  an  alarm). 
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Original  CDC  Logic  (Base  Case) 
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Figure  19.  Comparison  of  actual  Monterey  County  HlNl  cases  with  the  results  of  five 
FED  algorithms  under  the  original  CDC  logic  (Base  Case),  from  September  30, 

2008  to  October  29,  2009 

Surprisingly,  the  Cl,  C2,  and  C3  methods  were  of  little  to  no  value  in  signaling  an 
outbreak,  unlike  CUSUM  1  and  CUSUM  2  (aggressive  and  routine  parameters, 
respectively)  which  performed  much  better  at  signaling  alarms  prior  to  the  first  HlNl 
case  on  May  10.  Ultimately,  CUSUM  1  proved  to  be  the  best  EED  algorithm  because 
it  signaled  consistently  for  11  straight  days  up  until  the  first  actual  case.  On  the  other 
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Since  May  10,  2009  falls  on  a  Sunday,  the  “case”  appears  on  the  graph  as  Monday,  May  11. 
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hand,  CUSUM  1  also  signaled  between  the  two  HlNl  waves  (e.g.,  summer  months) 
when  the  smoothed  daily  counts  appear  nearly  flat,  which  makes  those  signals  look  like 
false  positives.  Yet,  despite  the  seemingly  flat  trend  line,  the  CUSUM  1  alarms  do 
correspond  to  the  high  number  of  actual  HlNl  cases  in  Monterey  County  during  that 
same  timeframe.  Table  6  breaks  down  the  number  of  alarms  corresponding  to  each  BED 
algorithm  by  time  period,  under  “Base  Case”  logic. 


Time  Period 

Algorithm 

Seasonal  flu 
(3/02-4/15) 

1st  HlNl  wave 
(4/16-6/10) 

Summer 

(6/11-8/31) 

2nd  HlNl  wave 
(9/1-10/29) 

C3 

0 

1 

2 

0 

C2 

0 

0 

0 

0 

Cl 

0 

0 

1 

0 

CUSUM2 

1 

7 

8 

0 

CUSUMl 

6 

13 

25 

15 

Table  6.  Number  of  alarms  corresponding  to  five  BED  algorithms  under  “Base  Case”  logic 

from  March  2  to  October  29,  2009 

C.  EXPANDED  MCHD  LOGIC  (VARIANT  lA) 

Bigure  20  compares  actual  Monterey  County  HlNl  cases  with  the  results  of  five 
BED  algorithms  under  the  “expanded”  MCHD  logic  (Variant  la).  Notice  that  during  the 
seasonal  flu  period,  the  “aggressive”  parameters  (ATBS=5,  A:=0.56  and  /z=0.296)  of 
CUSUM  1  continued  to  signal  even  during  the  steady  decline  of  IBI  counts  later  in  the 
season,  unlike  EARS’  methods  which  gave  no  signals.  That  is,  given  the  ATBS  is  set  at  5 
days,  CUSUM  1  is  going  to  signal  often,  whether  or  not  an  outbreak  actually  exists.  On 
the  other  hand,  CUSUM  1  is  also  the  most  reliable  at  signaling  when  there  are  increases 
in  the  smoothed  count  line.  Given  MCHD’s  desire  for  a  high  sensitivity  BED  system, 
frequently  occurring  false  alarm  rates  were  acceptable  to  decision  makers.  Table  7 
breaks  down  the  number  of  alarms  corresponding  to  each  BED  algorithm  by  time  period, 
under  “expanded”  MCHD  logic. 
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Daily  Count 


Expanded  MCHD  Logic  (Variant  1A) 


Figure  20. 
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Comparison  of  actual  Monterey  County  HlNl  cases  with  the  results  of  five 
FED  algorithms  under  the  “expanded”  MCHD  logic  (Variant  la),  from 
September  30,  2008  to  October  29,  2009 
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Time  Period 

Algorithm 

Seasonal  flu 
(3/02-4/15) 

1st  HlNl  wave 

(4/16-6/10) 

Summer 

(6/11-8/31) 

2nd  HlNl  wave 
(9/1-10/29) 

C3 

0 

3 

3 

0 

C2 

0 

0 

0 

0 

Cl 

0 

0 

0 

0 

CUSUM2 

1 

8 

10 

3 

CUSUMl 

7 

17 

23 

17 

Table  7.  Number  of  alarms  corresponding  to  five  BED  algorithms  under  “expanded” 

MCHD  logic  from  March  2  to  October  29,  2009 

While  C3  and  CUSUM  2  did  in  fact  signal  during  the  first  HlNl  flu  wave,  it  was 
CUSUM  1  that  signaled  consistently  for  10  straight  days  prior  to  the  first  HlNl  case  on 
May  10,  2009.  EED  methods  Cl  and  C2  failed  to  signal  any  alerts  for  the  entire  outbreak 
period,  rendering  them  completely  ineffective  to  decision  makers.  Lastly,  notice  the 
Monterey  County  ILI  trends  using  the  expanded  logic  do  not  appear  to  match  the  national 
trends  during  the  second  HlNl  wave  (e.g.,  as  the  national  ILI  count  went  up,  MCHD 
counts  went  down).  Perhaps  the  key  takeaway  here  is  that  the  detection  methods  (EARS 
and  CUSUM)  are  working  off  the  syndrome  data,  so  they  should  signal  when  that  data 
shows  an  increase  in  ILI  counts.  Then,  separately,  the  syndrome  data  should  show 
“peaks”  around  actual  cases.  Ideally,  the  syndrome  data  would  show  the  underlying 
HlNl  outbreak  as  evidenced  by  known  cases  and  the  detection  methods  would  signal 
given  that  count  increase. 

D.  RESTRICTED  MCHD  LOGIC  (VARIANT  2A) 

Eigure  21  compares  actual  Monterey  County  HlNl  cases  with  the  results  of  five 
EED  algorithms  under  the  “restricted”  MCHD  logic  (Variant  2a). 
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Restricted  MCHD  Logic  (Variant  2A) 
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Figure  21.  Comparison  of  actual  Monterey  County  HlNl  cases  with  the  results  of  five 

FED  algorithms  under  the  “restricted”  MCHD  logic  (Variant  2a),  from  September 

30,  2008,  to  October  29,  2009 


Notice  that  during  the  latter  part  of  the  seasonal  flu  period,  CUSUM  1  remained 
highly  sensitive  to  “bumps”  in  the  data  yet  it  also  signaled  six  alarms  (from  April  28  to 
May  5)  prior  to  the  first  Monterey  County  HlNl  case  on  May  10*’’.  Alternatively,  these 
signals  may  have  all  been  false  positives.  Nonetheless,  all  BED  algorithms  signaled  at 
least  once  within  10  days  prior  to  the  first  HlNl  case,  giving  credibility  to  the  remaining 
algorithms.  In  fact,  CUSUM  2  alarms  appear  to  mimic  those  of  EARS  (Cl,  C2,  and  C3). 
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On  May  26,  all  EARS  methods  signaled  prior  to  the  seeond  confirmed  HlNl  case  on 
June  8,  whereas  the  closest  CUMSUM  1  signal  came  on  May  18  (e.g.,  far  in  advance  of 
the  actual  outbreak).  Table  8  breaks  down  the  number  of  alarms  corresponding  to  each 
EED  algorithm  by  time  period,  under  “restricted”  MCHD  logic. 


Time  Period 

Algorithm 

Seasonal  flu 
(3/02-4/15) 

1st  HlNl  wave 

(4/16-6/10) 

Summer 

(6/11-8/31) 

2nd  HlNl  wave 
(9/1-10/29) 

C3 

2 

5 

6 

4 

C2 

2 

3 

5 

2 

Cl 

1 

2 

4 

1 

CUSUM2 

1 

3 

8 

6 

CUSUMl 

13 

9 

30 

14 

Table  8.  Number  of  alarms  corresponding  to  five  EED  algorithms  under  “restricted” 

MCHD  logic  from  March  2  to  October  29,  2009 

Overall,  the  five  EED  algorithms  performed  the  best  (e.g.,  signaled  with 
increasing  lEI  counts)  under  the  “restricted”  logic  of  Variant  2a,  as  compared  to  the 
“expanded”  logic  of  Variant  la  and  the  “original”  logic  of  the  Base  Case.  While 
CUSUM  1  is  by  far  the  most  sensitive,  the  other  methods  appear  to  have  signaled  at  the 
leading  edge  of  most  CUSUM  1  alarm  clusters.  This  raises  the  interesting  issue  of 
tradeoffs,  in  terms  of  a  continuous  signal  versus  an  alarm  “reset.”  That  is  to  say,  do  the 
continuing  signals  provide  additional  information  about  the  existence  of  an  outbreak? 
While  the  goal  of  this  research  is  to  highlight  the  implications  in  choices  of  logic,  this  is  a 
question  best  answered  by  public  health  officials. 

E.  A  CLOSER  LOOK  AT  CUSUM  1 

With  CUSUM  1  as  the  preferred  EED  method  across  all  sets  of  logic,  Eigure  22 
compares  CUSUM  1  signals  during  the  first  national  HlNl  flu  wave,  April  16-June  10, 
2009.  Notice  that  prior  to  the  beginning  of  the  flu  season,  CUSUM  1  signaled  across  all 
sets  of  logic.  Prior  to  the  first  HlNl  case,  as  indicated  by  the  asterisk  on  May  11, 
CUSUM  1  signaled  consistently  for  11  straight  days  using  the  Base  Case  logic  (as 
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indicated  by  the  blaek  tiek  marks).  Under  Variant  la  and  Variant  2a  logic,  CUSUM  1 
signaled  for  seven  and  six  eonsecutive  days,  respeetively,  but  quit  just  three  days  prior  to 
the  first  actual  case. 

Now  turn  to  Figure  23  to  see  how  CUSUM  1  performed  during  the  aetive  summer 
months,  starting  June  12  and  ending  September  1,  2009,  just  prior  to  the  seeond  national 
HlNl  flu  wave.  Here,  the  smoothed  ILl  eounts  for  all  three  sets  of  logie  appear  to  be  on 
an  upward  trend,  corresponding  to  the  numerous  HlNl  counts  during  this  summer 
period.  It  is  hard,  however,  to  visually  determine  which  logic  best  “matches”  the  HlNl 
eases  in  Monterey  County.  In  other  words,  given  the  CUSUM  1  FED  methodology, 
there  does  not  appear  to  be  a  elear  “winner”  for  whieh  set  of  logic  should  be  used.  Sinee 
identifying  the  leading  edge  of  an  outbreak  is  usually  of  most  importanee  to  publie  health 
offieials,  look  to  the  period  June  25  -  July  14  (indieated  by  the  dashed  horizontal  lines). 
During  this  time,  there  were  ten  confirmed  HlNl  eases  reported  in  Monterey  County,  of 
whieh,  the  CUSUM  1  algorithm  signaled  every  day  exeept  for  one  using  the  Variant  2a 
logie  (as  indieated  by  the  lowest  level  tiek  marks),  However,  there  is  evidenee  that  the 
Base  Case  data  results  in  more  sensitivity,  in  the  sense  that  for  this  one  eomparison,  it 
signaled  four  days  prior  to  the  restrieted  ease  (and  thus  eloser  to  the  first  aetual  ease). 


The  asterisk  on  July  7,  2009  is  representative  of  three  HlNl  cases. 
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Daily  Count 


CUSUM  1  Comparisons  (1st  MINI  Wave) 


Figure  22.  CUSUM  1  comparisons  across  the  various  sets  of  logic  during  the  first  HlNl 

flu  wave,  April  16-June  10,  2009 
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Daily  Count 


CUSUM  1  Comparisons  (Summer  2009) 


Figure  23.  CUSUM  1  comparisons  across  the  various  sets  of  logic  during  the  active 

summer  months,  June  12-September  1,  2009 
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V.  CONCLUSIONS 


This  chapter  summarizes  the  eonelusions  of  this  researeh  and  makes 
reeommendations  for  future  EARS  improvement  and  follow-on  study. 

A,  EXERCISE  CAUTION  WHEN  CHANGING  LOGIC 

EARS’  users,  such  as  MCHD,  have  the  option  to  modify  three  areas  of  logic  in 
order  to  alter  the  performanee  of  the  CDC’s  original  EARS  system.  While  the  system’s 
flexibility  is  eonsidered  a  tremendous  benefit  to  many  local  health  departments,  as 
evideneed  in  Chapter  II,  small  ehanges  in  logie  ean  have  large,  poorly  understood  effeets 
on  the  resulting  syndrome  eounts  and,  henee,  the  performanee  of  the  EARS  system.  The 
results  from  Chapter  II  ean  largely  be  summarized  by  Eigure  16,  whieh  illustrates 
smoothed  lEI  eounts  using  five  different  eombinations  of  text  matehing  logie,  symptom 
aliases,  and  syndrome  definitions. 

Under  the  “Base  Case,”  out  of  153,696  total  reeords,  9,093  reeords  (almost  6%) 
were  flagged  as  being  ILI  syndromes.  The  expanded  aliases  and  syndrome  definitions 
(Variant  la)  resulted  in  a  53%  inerease  in  the  number  of  reeords  flagged  for  the  lEI 
syndrome,  while  the  restrieted  aliases  and  syndrome  definitions  (Variant  2a)  resulted  in  a 
92%  deerease  from  the  original  “Base  Case.”  In  order  to  figure  out  whieh  combination  of 
methods  worked  best,  an  attempt  was  made  to  eompare  results  with  ICD-9  eodes.  Eor  the 
reasons  speeified  in  Chapter  II  and  despite  having  ICD-9  eodes  in  the  MCHD  dataset,  it 
was  determined  not  to  use  them  as  the  “gold  standard”  and  instead  use  doeumented  HlNl 
eases,  as  provided  by  MCHD. 

B.  ALGORITHMS  NEED  GOOD  DATA 

A  deteetion  algorithm  is  only  as  good  as  the  data.  Greater  emphasis,  therefore, 
should  be  foeused  on  improvements  in  data  eolleetion,  management,  and  syndrome 
definitions.  In  other  words,  biosurveillanee  systems  need  quality  data  and  a  preeise  way 
to  measure  that  quality  (e.g.,  standards  for  sensitivity  and  speeifieity).  Currently,  there 
does  not  appear  to  be  a  “gold  standard”  for  measuring  the  accuraey  of  EARS  until  after 
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an  outbreak  has  already  oeeurred,  such  as  the  case  with  this  research.  It  is  recommended 
that  the  public  health  community  take  the  lead  in  demanding  better  quality  symptom  data 
from  the  various  sources  available. 

Perhaps  another  area  for  improvement  lies  in  the  secondary  use  of  chief  complaint 
data  for  the  purposes  of  using  text-matching  algorithms.  Stated  differently,  chief 
complaints  are  serving  a  purpose  for  which  they  were  not  originally  intended.  Ideally, 
one  should  ask  what  information  would  be  most  useful  and  then  do  the  best  at  gathering  it 
rather  than  just  rely  on  what  data  is  collected  by  others.  If  chief  complaint  data  is  going 
to  continue  to  be  used  (and  there  isn’t  much  of  an  alternative),  then  the  text  matching 
logic  and  alias  lists  must  be  improved.  Further,  they  likely  need  to  be  tailored  to  local 
conditions,  conventions,  and  practices. 

In  the  course  of  this  research,  it  was  also  discovered  that  for  Monterey  County, 
EARS’  Cl,  C2,  and  C3  algorithms  factor  in  “zeros”  for  days  when  clinics  are  not  open 
for  business  (e.g.,  weekends  and  holidays).  Given  that  the  sample  standard  deviation  is 
based  on  the  previous  7-9  days  worth  of  data,  it  is  no  wonder  that  a  3  sigma  threshold 
fails  to  signal  as  often  as  it  should  for  Cl  and  C2.  Figures  19  and  20  illustrate  this  point. 
It  is  strongly  recommended  that  these  “zero”  data  points  be  discarded  before 
implementing  EARS  or  that  the  EARS  programming  logic  allow  the  local  user  the 
flexibility  to  redefine  the  workweek  from  seven  days  to  whatever  local  conditions  dictate. 

C.  RESTRICTED  LOGIC  IS  PREFERRED 

Surprisingly,  under  “original”  and  “expanded”  sets  of  logic,  EARS  methods  were 
of  little  to  no  value  in  signaling  an  outbreak.  Ultimately,  it  was  CUSUM  1  that  proved  the 
most  reliable  at  signaling  alarms  prior  to  and  throughout  the  time  when  Monterey  County 
was  experiencing  HlNl  cases.  Given  that  EARS  does  not  utilize  the  CUSUM 
algorithms,  however,  it  is  clear  from  the  results  of  Figure  21  that  the  “restricted”  logic  of 
Variant  2a  is  to  be  preferred  for  use  by  MCHD,  at  least  in  comparison  to  the  two  other 
options  evaluated.  Of  note,  EARS  signaled  at  the  leading  edge  of  most  CUSUM  1  alarm 
clusters  under  these  conditions.  Given  that  CUSUM  1  is  by  far  the  most  sensitive  across 
all  variations  in  logic,  it  brings  about  the  issue  of  tradeoffs,  in  terms  of  a  continuous 
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signal  versus  an  alarm  “reset.”  It  is  reasonable  to  question  if  these  continuous  signals 
provide  value-adding  information  about  the  existence  of  an  outbreak.  While  the  goal  of 
this  research  is  to  highlight  the  implications  in  choices  of  logic,  this  is  a  question  best 
answered  by  public  health  officials. 

D,  FUTURE  RESEARCH  OPPORTUNITIES 

It  would  be  interesting  to  observe  how  EARS  would  perform  given  the  non-zero 
data  point  entries,  as  discussed  in  Section  B  above.  One  might  also  want  to  assess  the 
performance  of  EARS  for  thresholds  other  than  those  currently  fixed  in  the  program 
where,  for  alternate  thresholds,  EARS  may  be  able  to  signal  “appropriately.” 
Alternatively,  one  could  observe  how  EARS  methods  would  perform  once  adjusted  for 
seasonal  trends,  such  as  found  within  lEI  data. 

Einally,  more  research  should  be  devoted  towards  exploring  the  issue  of  false 
positives.  As  an  example,  in  Chapter  IV  it  was  determined  that  the  “aggressive” 
parameters  of  CUSUM  I  (A  =  0.296,  AXES  =  5,  k  =  0.56)  performed  reasonably  well 
across  all  three  sets  of  logic:  CDC’s  original  (Base  Case)  logic,  MCHD’s  expanded 
(Variant  la)  logic,  and  MCHD’s  restricted  (Variant  2a)  logic.  Even  after  looking  closely 
at  the  CUSUM  1  signals  during  the  first  HlNl  flu  wave  and  during  the  peak  summer 
months;  however,  there  did  not  appear  to  be  a  clear  “winner”  for  which  set  of  logic 
should  be  used.  Then  again,  this  research  only  focused  on  one  defined  syndrome,  lEI. 
Euture  research  is  certainly  recommended  to  measure  the  robustness  of  these  results 
across  a  variety  of  other  syndromes. 
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APPENDIX  A.  R  CODE  FOR  “ENHANCED  NPS  LOGIC” 


Build  a  function  ("finder")  to  look  for  matches  of  "str"  (string)  in  "vec"  (vector).  Note  that  this  is 
more  sophisticated  than  the  SAS  coding  that  MCHD  is  currently  using,  which  only  looks  for  a 
simple  match  anywhere  in  "vec."  Finder,  on  the  other  hand,  only  allows  matches  if  "str"  is  at 
the  beginning  or  end  of  a  word  (or  matches  vec  completely)  for  "str"  longer  than  three 
characters.  It  requires  even  more  restrictive  matching  for  "str"  of  length  three  or  fewer 
characters. 

finder  <-  function  (str,  vec) 

{ 

noletters  <-  "[^A-Z]" 

vec  <-  paste  ("  ",  vec,  "  ",  sep="") 

lefty  <-  paste  (noletters,  str,  sep="") 

righty  <-  paste  (str,  noletters,  sep="") 

shorty  <-  paste  (noletters,  str,  noletters,  sep="") 

if (nchar (str) >3) 

{regexpr  (lefty,  vec,  ignore . case=TRUE)  >  0  |  regexpr  (righty, 

vec,  ignore . case=TRUE)  >  0  |  toupper(str)  ==  toupper (vec) } 

else{regexpr  (shorty,  vec,  ignore . case=TRUE)  >  0  |  toupper (str) 

==  toupper (vec) } 

} 
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APPENDIX  B.  MATLAB  CODE  FOR  CALCULATING  CUSUM 


%Initialize  variables 

k;=  1.06;  %Placeholder  until  k  can  be  established 
h=  .62;  %Placeholder  until  h  can  be  established 
baselinePeriod  =  35; 

startupPeriod  =  45;  %Placeholder  for  #days  of  residuals  to  use 
cusum  =  0; 
alarmCount  =  1; 

actualData  =  dlmread (' unbiased . CDClcounts . csv ')  ; 

matXl  =  [  ones (baselinePeriod, 1 )  (baselinePeriod:  (-1)  : 1)  ']  ; 

dayOfAlarm  =  ( 1 : 1 : size (alarmCount) ) ;  %Initialize  vector  to  track  days 
in  which  cusum>=h 
today  =  baselinePeriod; 
tomorrow  =  today+1; 

residuals  =  ( 1 : 1 : length (actualData) ) ; 
stdResidualVector  =  ( 1 : 1 : length (actualData) ) ; 

%Calculate  residuals 

for  i=l : 1 : startupPeriod 

recentData=  actualData ( (today: (-1) : (today-baselinePeriod+1 ) ) , 1) ; 

matX2=  actualData ( (today:  (-1)  :  (today-baselinePeriod+1 ) ) ,2:5) ; 
matX3  =  [matXl , matX2 ] ; 
b=  regress ( recentData,  matX3) ; 

predCount=  ( [1  (baselinePeriod+1 ) , actualData ( (today+1) ,2:5)]) *b; 

residuals (tomorrow) =  actualData ( (tomorrow) , 1) -predCount; 

today=  today+1; 
tomorrow=  today+1; 


end 


while  (today<length (actualData) -1) 
if  (cusum>=h) 

dayOfAlarm (alarmCount)  =  today; 
alarmCount  =  alarmCount+1 ; 


end 

recentData=  actualData ( (today: (-1) : (today-baselinePeriod+1 ) ) , 1) ; 

matX2=  actualData ( (today:  (-1)  :  (today-baselinePeriod+1 ) ) ,2:5) ; 
matX3  =  [matXl , matX2 ] ; 
b=  regress (recentData,  matX3) ; 

predCount=  ( [1  (baselinePeriod+1 ) , actualData ( (today+1) ,2:5)]) *b; 
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residuals (tomorrow) =  actualData ( (tomorrow) , 1) -predCount; 
stdDev  =  std ( residuals (tomorrow :( -1 ): tomorrow-startupPeriod) ) 


stdResidual  =  residuals (tomorrow) /stdDev; 
stdResidualVector (today)  =  stdResidual; 

cusum=max(0, ( stdResidual-k+cusum) ) ; 

%cusum 

today=  today+1; 
tomorrow=today+l ; 


end 
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APPENDIX  C.  STANDARDIZED  RESIDUAL  PLOTS  AND  QQ 
PLOTS  OF  HISTORICAL  DATA 


To  plot  residuals  in  Matlab: 

>X=stdResidualVector(45:119)  %with  baseline  35  days  and  “startup”  10  days;  119  =  1/3 

of  1.5yrs 

>plot(X) 

>xlabel(‘time’) 

>ylabel(‘ standardized  residuals’) 

>title(‘time  series  plot  of  stdResiduals’) 


CDCl  “Base  Case” 
>mean(X) 

=  .0429 


CDCl  “Base  Case” 
>  std(X) 

=  1.1036 


QQ  Plot  of  Sample  Data  versus  Standard  Normal 


Standard  Normal  Quantiles 
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MCHDl  “Variant  lA” 
>mean(X) 

=  0.0211 


MCHDl  “Variant  lA” 
>  std(X) 

=  1.0073 


Standard  Normal  Quantiles 
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MCHD2  “Variant  2A” 
>mean(X) 

=  0.0068 


MCHD2  “Variant  2A” 
>  std(X) 

=  1.0987 


_a3 

Q. 


E 

CO 

CO 


a 


QQ  Plot  of  Sample  Data  versus  Standard  Normal 
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APPENDIX  D.  R  CODE  TO  ESTIMATE  CUSUM  THRESHOLD 


The  output  is  the  estimated  ATFS  and  its  standard  error. 

IC . ARL . EST . func  <-  function (x, h, k) { 
runs  <-  rep(0,x) 

CUSUM  <-  0 
for ( i  in  1 : x) { 
while (CUSUM<h) { 

CUSUM  <-  max(0,  (CUSUM  +  rnorm ( 1 ) -k) ) 
runs[i]  <-  runs[i]  +  1 
} 

CUSUM  <-  0 

} 

print (c (mean (runs) , sd (runs) /sqrt (x) ) ) 

} 

To  get  an  ATFS=5  (s.e. =0.0045)  with  k=0.56,  set  h=0.296.  Flere's  the  output: 

>  IC. ARL. EST. func (1000000, 0.296, 0.56) 

[1]  5.004290000  0.004450843 

To  get  an  ATFS=20  (s.e. =0.00)  with  k=1.06,  set  h=0.62.  Flere's  the  output: 

>  IC. ARL. EST. func (1000000,0.622,1.06) 

[1]  20.01101900  0.01944600 
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